The Morbidity and Mortality Weekly Report (MMWR) series is published by the Centers for Disease Control and Prevention (CDC), as the name implies, on a weekly basis. The series is the CDC’s primary route for, “.scientific publication of timely, reliable, accurate, objective, and useful public health information and recommendations.” Specifically, the weekly series, the focus of this study, is based on weekly reports to the CDC from state public health departments; thus, each issue represents current trends in notifiable diseases, outbreaks, and conditions of public health concern across the United States.[1] It currently has a SCImago Journal Rank (SJR) score of 9.323, ranked 88th among 34,171 journals[2] and a journal impact factor of 12.888, ranked second in the subject category ‘Public, Environmental & Occupational Health’ in the Science Citation Index Expanded edition of Journal Citation Reports.[3]
Given these frequent, reputable, and impactful publications, MMWR Weekly represents a robust longitudinal source of public health information, but digital editions currently exist in an entirely unstructured format, either in HTML or PDF. However, with natural language processing (NLP) and machine learning (ML), one could process these publications to gleam public health insights and trends for the United States.
For each issue published between 1993-2018, with the exclusion of 2015, we identified all single-, bi-, and tri-grams using the RWeka package. Each of these grams was then passed to Amazon Web Services (AWS) Comprehend Medical, an NLP tool specifically trained to recognize medical entities in raw text. The RWeka preprocessing was necessary to stay within AWS’s ‘Free Tier.’ Medical entities returned include anatomy, medications, medical conditions, medications, protected health information, and tests, treatments, & procedures. However, entity categories of anatomy and medications were excluded. The resulting dataset includes 106,983 observations with the following columns & attributes:
## id year quarter month issue text count main_score
## 1 666508 2017 4 11 47 ab pinaca 35 0.5914666
## 2 668140 2017 4 11 48 ab pinaca 3 0.3222043
## 3 666857 2017 4 11 47 ab pinaca mitragynine 5 0.3430346
## 4 112831 1999 1 2 6 abamectin 3 0.5634137
## 5 115335 1999 1 3 13 abattoir 4 0.3637859
## 6 116322 1999 2 4 16 abattoir 4 0.3637859
## category type trait trait_score
## 1 medication generic_name NA
## 2 medication generic_name negation 0.4096546
## 3 medication generic_name NA
## 4 medication brand_name NA
## 5 medication brand_name NA
## 6 medication brand_name NA
This dataset would be interesting to visualize for several reasons: 1) It’s never been visualized before; 2) Gleaming insight from how such an immense number of medical entities with multiple attributes are mentioned across 26 years likely requires advanced visualization techniques
With the MMWR dataset described above, one can visualize multiple kinds of information. First, through simple bar charts, like the ones above, one can see overall summary statistics of the top entities mentioned. These could be further refined and filtered by entity category, type, trait, year, quarter, month, and issue. One could also filter out entities below certain match scores. However, while these types of summary graphs might represent good overview visualizations, users will likely want more granularity such as how these entities were mentioned over time and how they correlate with other entities. Thus, displaying frequencies by year, quarter, month, and/or issue, as well as some type of correlation plot(s), should also be considered. Potential users likely represent the same target audience as MMWR Weekly readers, medical and public health professionals.
The main challenge with this dataset is visualizing both overview summary statistics, the trends across years, and the correlations among the medical entities. All require separate types of visualizations. Additionally, with the sheer number of entities, visualizing all of them simultaneously would not be a pleasant user experience or useful in gleaming insight from the dataset. To address these challenges, visualizations involving this dataset should involve some type(s) of interaction techniques, likely overview + detail and/or brushing + linking.
To implement our application, we plan to create a Shiny app with three different figures. The first and top-most figure will be a simple histogram with individual terms along the x-axis (ie–fever, immunization, influenza, etc) and frequency of occurrence of those terms on the y-axis. This will be ordered in descending values of the number of occurrences (ie–the most frequently occurring word will appear on the left of the plot). Below this primary figure, we will display a line graph with the year along the x-axis and frequency of occurrence of those terms on the y-axis. Each line will represent a term with color and line style to encode the unique terms. Below this second figure, we plan to have a correlation matrix to display correlation coefficients between the displayed terms. The correlation coefficient will be encoded by color. We plan to use a diverging color scheme for this encoding.
Our plan calls for several ways to display the data and interaction mechanisms. In the left side bar panel, we plan to use sliders and other selection mechanisms to filter the data by: year, number of words (total number of terms to be displayed on the primary figure’s x-axis), medical entity category (AWS Medical Comprehend parse terms into medications, medical conditions, and tests/treatments/procedures), delineation of time measure to calculate the correlation coefficients (by year, month, quarter, issue), and be able to set a threshold for the correlation coefficients to be displayed. Interaction between the figures will occur at several levels. In order to display which terms are selected from the top primary figure to be displayed in the lower second and third figures, we plan to implement a selection feature directly on the primary figure (ie–you can click and drag to select multiple bars which then this selection filters the second and third figures). The second figure will allow for directly selecting a time period to filter all the data. Finally, selecting a square within the correlation matrix of the third figure will filter the second (middle figure) to display only those two terms.
Sheet 1: Brain Storm
Sheet 2: Initial Design A
Sheet 3: Initial Design B
Sheet 4: Initial Design C
Sheet 5: Realization Design
Josh completed:
- Design sheets 1-4
- Implementation Plan
- Screenshots of live demo
- Future work ideas and sketches
- PowerPoint slides for presentation
Colby completed:
- Dataset description & key statistics
- Visualization tasks & requirements
- Design sheet 5
- RShiny app coding
- PowerPoint slides for presentation
Figure 1: Initial, default view of Shiny app upon first loading. Displays the main plot (a histogram) showing the cumulative frequency of extracted medical entities.
Figure 2: Detailed view of data selection tools.
Figure 3: Overview of main plot (top histogram) and two subplots. Selecting a subgroup of the medical entities from the top histogram creates subplots for those selected entities.
Figure 4: The first subplot is a line plot showing the frequency of the selected entities over the selected time period.
Figure 5: The second subplot is a correlation matrix displaying how highly correlated two terms for the given set of parameters. (Main histogram is cutoff in this screenshot).
Figure 6: Detailed view showing brushings on each plot.
Figure 7: Brainstorming design sheet for future directions.
MMWR reports frequently include location data. This is generally in relation to outbreaks of certain infectoius diseases that pose a particular threat to human health. The included location is typically at the county or state level as this is usually the most granular public reporting of such outbreaks reaches. Understanding where these outbreaks occur across the United States, how they cluster, and how they change over time could potentially be valuable for public health professionals. After briefly undertaking a brainstorming session, we feel such data could most accurately by a proportional symbold map (ie–circles or other shapes scaled to the size of a given value at a specific location). This is more favorable to a chloropleth as indicating a value by coloring a state or county can inappropriately give prominence to over-sized sparsely populated large states like Wyoming or Montana. Therefore, we would propose to have a map of the United States overlayed with circles at the location of an outbreak. The number of cases would be indicated by the size of the cirlce. Color of the cirlces could be used to indicate different types of outbreaks (ie–E coli versus Salmonella infections). A major question (and potential limitation) for this approach is whether or not we could adequate represent temporality to show how an outbreak evolves over time.